Determining the Gender of the Unseen Name through Hyphenation
نویسندگان
چکیده
The accepted method of determining name gender is to use a probabilistic model based on observations, which fails to classify unseen names. We attempt to solve this by utilising a hyphenation-driven method which is also more space efficient. The ability to cross-check several fields within a record is of value as it permits us to validate the information provided. We concentrate here on determining the probable gender of a name so that it can be compared with other gender-related fields. Thus, a record of a person with the salutation “Mr.”, given name “John”, and whose gender is coded as a woman may be of questionable value and require additional inspection. A common method involves the use of a probabilistic model built from name observations [1, 2]. This approach suffers from the inability to provide information on names which have not been previously observed. Alternatives which have been explored are the use of edit distance methods [3] or soundex matching to identify similar names. In our method we use the hyphenated form of the name to infer its gender. 3 We used a trivial method to generate rules from hyphenated words by extracting the last token and using this as the basis for a probabilistic model (e.g.: “elizabeth” would hyphenate to “eliz-a-beth” from which we would extract the “beth” suffix.). The generated rules can be looked up without hyphenating the names themselves by matching the right-hand-side of the word to the rules. For this experiment, we used the readily available LTEX hyphenation files for the English and German languages along with an open-source hyphenator [4, 5]. The hyphenation method is not critical as it provides a segmentation method; syllables could also be used, but with a high complexity cost. A dataset of name-gender pairs generated from GEDCOM [6] genealogy files with over 60,000 individuals and more than 5,000 unique names was used to validate the method. To avoid cumbersome data cleaning and character set issues, we only processed names which contained the basic US-ASCII character set. Table 1 contains a breakdown of the precision and recall figures for each of the classification models. We found that the hyphenation driven classification was correcly 3 The authors wish to thank Dr. Brett Kessler of Washington University in St. Louis for help with this approach. assigning gender in 80% of applicable cases. Interestingly, only about 20,000 names were required for all models to return a consistent performance.4 Method Precision Recall Decision table size in rows Name lookup 85% 87% 5742 Hyphenation lookup 87% 96% 1560 Name + Hyph. fallback 93% 96% 7302 Hyphenation (unseen only) 80% 10% 1560 Table 1. Precision / Recall measures. The hyphenation model is very efficient as it requires 66% less rules than a name lookup model with a comparable performance on observed names. Hyphenation was able to classify an additional 10% of the names with a high precision. About 3% of names remained unclassifiable. A valuable aspect of using a hyphenated model to identify gender is that the recall histogram of its rules is narrower than a basic name model. In situations where space is very limited, such as in field data-entry applications, a hyphenated model delivers a higher value than a standard name lookup model. This novel heuristic to classify unseen names is computationally inexpensive and allows us to cross check database records for proper gender identification.
منابع مشابه
A Critique of the View Claiming Conflict in the Verses of the Knowledge of the Unseen
The claim of conflict in the verses of the knowledge of the unseen in Quran is one of those made by Brasher – the Jewish orientalist. He believes that the verses which consider the knowledge of the unseen to be only specific to God are in conflict with those verses referring apparently to the Prophet (p.b.u.h) and some of the divine selected people's awareness of the unseen. Classifying the ver...
متن کاملVisually Interpreting Names as Demographic Attributes by Exploiting Click-Through Data
Name of an identity is strongly influenced by his/her cultural background such as gender and ethnicity, both vital attributes for user profiling, attribute-based retrieval, etc. Typically, the associations between names and attributes (e.g., people named “Amy” are mostly females) are annotated manually or provided by the census data of governments. We propose to associate a name and its likely ...
متن کاملSurveying Introspection of Architecture of Jame` Mosque of Isfahan with Emphasis on Grounded Study of Unseen Concepts of Hafez' and Mulavi's Lyrics
There are close relationships between hidden structures of mosques and unseen concepts embodied in Persian language and literature of Iran that show that construction of famous mosques in Iran, especially in Isfahan Style are immortal and timeless. A question arises in this context as to what factors have led to the manifestation of unseen concepts in the architecture of Isfahan mosques object...
متن کاملMeaning and reasonability of »belief in the unseen« with regard to Sadr al-Mutuallihin and Imam Khomeini`s views
This article has no abstract.
متن کاملThe Place-Name as an Intangible Place of Memory (A Holistic Approach in Reading the Place-Names through a Comparative-Analytical Study on the Character of Name and Place)
Understanding architectural heritage and their various aspects have always been a subject of focus for the international conservation communities. Within the recent decades, eventhough the place-names are part of the living history as well as cultural heritage, they have still constantly been facing quick precipitant changes. As such, in the Conservation literature, most studies have skipped ad...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004